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The outbreak of the COVID-19 pandemic occurred some time ago, making 
the world a pandemic. Based on this condition is important to predict early 
to prevent the COVID-19 disease if someday pandemic occurs. The aim of 
the study is to compare the analysis result of cumulative cases of COVID-19 
using multiple linear regression (MLR), ridge regression (RR), and long 
short term memory (LSTM) models for cases study Java and Bali islands. 
We chose both islands as a case study because they have very dense 
populations. These three models are the most widely used time series-based 
prediction models and have relatively high accuracy values. The predictive 
variables used are the number of cumulative cases, the daily cases, and 
population density. The research data was taken from Kaggle and processed 
using google collabs. Data was taken from January 20, 2020, to August 8, 
2020, and data training was carried out for 12 days. The results show the 
accuracy of LSTM is better than other models. it can be seen in the accuracy 


value (99.8%) of the model test result. The testing model uses R2, mean 
square error (MSE), and root mean square error (RMSE). 
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1. INTRODUCTION 

Since COVID-19 occurred in Indonesia in February 2019, all activities have changed into regular 
habits. In 2020, Java and Bali islands had the highest spread of COVID-19 cases in Indonesia. Java and 
Bali’s islands are Indonesia's most significant business and tourism destinations. Judging from the number of 
positive cases, both of the islands contributed 67.76% of the total national cases. In the following order, 
Sumatra, Kalimantan, Sulawesi, Nusa Tenggara, and Maluku-Papua are followed in the last order regarding 
the number of positive cases. This is because Java and Bali dominate the population in Indonesia and there is 
a capital city in it. Indonesia is a country with the 5th largest population in the world [1]. Recently the 
population in Indonesia has reached more than 270 million people and spread throughout the islands of 
Indonesia. Currently, the Indonesian government is using all efforts to suppress the rate of positive COVID- 
19 cases by boosting vaccinations. There was also chaos caused by the surge in COVID-19 patients in 
Indonesia. This condition should have been predictable before and taken the strategic way for dealing with 
the COVID-19 patients in Java and Bali islands. Prediction is one significant way of helping the government 
and others, mapping out and preparing health services in the pandemic era. Prediction and machine learning 
have a relationship in the process-relational approach, which improves processes, data quality, and model 
quality [2]. Prediction algorithms have been implemented by many researchers before. One of the machine 
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learning algorithms for prediction is regression. Regression functions to predict the future by analyzing past 
events. It can be to minimize the risk or impact that will occur in the future, both short-term and long-term 
[3]. The data of the COVID-19 cumulative cases is time-series data. Because it has patterns depending on the 
previous time trend [4]. The research aim is to help the government anticipate the COVID-19 outbreak. 
Information and communication between researchers and the government are essential for decision-making 
and strategic planning for outbreak handlers. Strategic planning is for preparing hospital facilities, the 
immune system of the infected person socialization, steps taken to combat the proliferation of the virus, and 
so on to make it completely informative [5]. Several researchers compared linear regression and support 
vector machine (SVM) in machine learning-based for a number of cases prediction [6], [7] and the result is 
linear regression has better accuracy than the SVM algorithm. But linear regression only contains one 
variable, meanwhile MLR is able to contain two or more variables in the prediction. Based on it, several 
studies use multiple linear regression (MLR) for prediction, there are Rath et al. [8] compared a linear 
regression algorithm with MLR for the prediction of COVID-19 cases in India. The model was evaluated 
using R2 and the result was 0.995. However, this study did not explain the case prediction results for some 
time in the future, so it did not reveal the accuracy of the detailed historical model. Wahyuni et al. [9] use the 
MLR model for COVID-19 cases prediction in Indonesia. The model accuracy test used R2 and the results 
were 0.999. The other study that uses MLR is [10]. However, MLR is considered less suitable for prediction 
because it is often overfitting, so some studies use the ridge regression (RR) algorithm to avoid overfitting. 
Because RR applies regularization to the predictive variable coefficients, and in this way selects the 
coefficients in a way that is kept as low as possible. The effect predictive variable does not have a major 
effect on the outcome variable, based on it some studies use the RR for predicted cases [11]—[13]. Liu, 
compared linear regression, logistic, and recurrent neural network (RNN) models to predict the trend of 
COVID-19 in the US. The comparison results show that RNN is more accurate than the other two 
models [14]. 

RNN is a generalization of a feed-forward neural network that has internal memory. RNN is 
iterative because it performs the same function for each data input while the output of the current input 
depends on previous calculations. After generating output, it is copied and sent back to the network over and 
over. To make a decision, it considers the current input and the output that has been learned from the 
previous input. RNN in machine learning and deep learning is widely used to make various predictions, 
including weather predictions, stock prices [15] electrical diagram (ECG) recording [16] etc. RNN is also 
considered capable of predictions based on time series data. One of the problems with RNN is that the 
gradient disappears [17], [18]. To solve this problem, long short term memory (LSTM) is considered suitable 
for predicting time series [19]. LSTM is an RNN model development by adding one cell state, which 
functions to store time-series data for a long time [20], [21]. LSTM is often used to predict infectious 
diseases, prediction of dengue disease [22]-[24], mouth and foot disease [25], Hepatitis [26], [27], 
chickenpox [28], Malaria [29]-[31], Tubercolusis [32] and other infectious diseases like COVID-19 
prediction. several studies on LSTM for COVID-19 prediction are Indriani et al. [33]. conducted research on 
LSTM models for COVID-19 prediction in Indonesia using epoch 50 and lookback 8 stating that the LSTM 
model is suitable for the forecasting model. Bedi et al. [34] compared the LSTM model with susceptible- 
exposed-infected-recovered-deceased (SEIRD) for the short-term prediction of the outbreak of COVID-19 
cases in India. The data training is carried out for the next 30 days, the results show that the LSTM model is 
better and in accordance with COVID-19 cases in India than other models. Iqbal et al. [35] used the LSTM 
model for the prediction of COVID-19 cases in Bangladesh, and the results show that the LSTM model is 
recommended to be used as a prediction model because it has a high accuracy value, especially in the 
regression task. Rauf et al. [36] developed the LSTM model for prediction and the results showed that the 
LSTM has a high accuracy of 99,525 compared to other models. The other studies that used LSTM for 
COVID-19 prediction are [37]-[43]. 

Based on the literature above, no one has directly compared the accuracy of the MLR, Ridge, and 
LSTM models by adding the population density parameter. Because in some of the studies above no one has 
discussed the RR model for COVID-19 predictions. In this study, predictions of COVID-19 will be made 
using the RR, MLR, and LSTM models based on machine learning. These three models are considered to 
have very high accuracy values in terms of prediction. However, we will compare these three models for the 
predicted case of COVID-19 in developing countries such as Indonesia by using the population density 
variable on the islands of Java and Bali. The prediction results will be explained in detail, as a 
recommendation for decision-making in the future. The population density variable is a time series variable 
so it is suitable for the three prediction models. The next sections of this study will discuss the research and 
method in section 2, the results and discussion in section 3, the conclusion in section 4, and the references in 
the last part. 
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2. METHOD 
2.1. Data set selection and processing 

The data was taken by Kagle and taken from January 20, 2020, to August 1, 2020. The data training 
was carried out for 7 days to see the trend in the number of COVID-19 cases on the islands of Java and Bali 
in the next 4 days. The data is processed using google collabs using three variables as machine learning 
models, namely cumulative cases, daily new cases and population density on the islands of Java and Bali. We 
divide the data set into two, namely 80% as training data and 20% for model testing. 


2.2. Multiple linear regression 

The linear regression model is a simple prediction model in machine learning. This model just 
predicts using two parameters. With the development of linear regression, MLR can be used to expect more 
than two parameters [44], [45]. The MLR model can be shown in (1). 


Y = Box, + 1X2 + Pox3 + € (1) 


Where in (3) is equivalent to (2). 
E(y) = Box, + BixX2 + Box3 + PpXp (2) 


Where p is the number of independent variables, y is a predictor x is the independent variable, p is the 
coefficient, and e is constant [10], [46]. And the coefficient is shown in (3) and (4). 


nX XY-(LX)YY 


Bo = n>. x2- (XX)? (3) 


Y-Bo DX 
By = Bent (4) 
2.3. Ridge regression 

RR is a technique to develop and stabilize the regression coefficient value because of 
multicollinearity. Multicollinearity is a strong correlation or relationship between two or more independent 
variables in a multiple regression model situation. This method is intended to overcome the bad conditions 
because of the high correlation between several independent variables in the regression model. In this case, 
the matrix to be nearly singular generates the estimated value of the model parameter unstable regression 
[12]. RR is a modification of the method of least squares which produces a biased estimator of the regression 
coefficient [13]. RR is a special algorithm of regression for multilinear regression information that has 
multicollinearity. The RR formula is presented in (5). 


cos(a) = +09 (z, — 4) ly £4 a2 (5) 


Where yis the slope and if lambda = 0, the RR is equal to least squares regression and when lambda = 
infinity, all coefficients shrink to zero. 


2.4. LSTM 

Hochreiter and Schmidhuber [28] have proposed LSTM to overcome the vanishing and exploding 
gradients problem [47]. The memory of the LSTM cell will be stored and converted from input to output in 
cell state. long short-term memory is a particular type of RNN. It has a better effect on time series prediction 
[48]. Entering the LSTM will first pass through the forgetting gate and then through the sigmoid layer, called 
the forgotten gate layer, indicating whether sort forgot to store in the last cell the decision on C,_,, 0 means 
completely forgotten, 1 means completely ordered. Taking the current state h;_, and the new input x; as the 
input of this layer, the output is the value (0,1) [49], [50]. LSTM has four layers, namely forget gate (1), input 
gate (2), new cell state candidate (3), and output gate (4) in the model loop as shown in Figure 1. 

In Figure 2, LSTM is defined in the following formula 8. 


ft = (Wf. [hy_1, x] + bf) (6) 


This decides which information can be transferred to the cell. The information from the previously ignored 
memory input is resolved by the forget gate and is defined as (7). 


Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 1, October 2022: 600-610 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 o 603 


it = (Wi. [hy_1, xe] + bi) (7) 
Ct = tanh (Wc. [hy_1, xt] + bc) (8) 
Ct = ft * Ce- + it * Čt (9) 
ot = (Wo. [hy_1, xt] + bo (10) 
ht = ot * tanh (Ct) (11) 


Figure 1. Looping in LSTM model 


2.5. Proposed framework 

In this study, three-time series prediction algorithms have been compared to obtain short-term 
predictions. Then the accuracy results of the three algorithms have been reached in order to get the best 
accuracy results among the three. The research was begun by collecting the dataset and the second process is 
to clean the data by eliminating the value 0. The data have been split into two parts, namely 80% training for 
data and 20% testing data. The next step was to predict using multiple linear regression, RR, and LSTM. The 
last process was parameter evaluation using the R*, mean square error (MSE), and root mean square error 
(RMSE). The proposed framework is described in Figure 2. 
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Training scenario 


(Split data training dan Testing) 
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MLR, RR and LSTM 
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Figure 2. Proposed framework 


Comparative analysis of time series prediction model for forecasting COVID-19 ... (Sri Ngudi Wahyuni) 


604 m) ISSN: 2502-4752 


3. RESULT AND DISCUSSION 
3.1. Data set 

The data set were taken from Kaggle and already used in the [9], [33]. Data is updated by the 
national disaster management agency (BNPB-Indonesia National Disaster Management Authority). This 
study only uses three parameters for predictive analysis, namely cumulative cases, daily new cases, and 
population density. 


3.2. Evaluation metric 
The R° score measures the relationship between the independent and dependent variables using the 
regression model [51]. The R? formula is: 


R2= SSREGRESSION (14) 
SSTOTAL 

the SSRegression is the sum of squares in the regression results, and SSTotal is the total number of all data. 

The second evaluation parameter used is mean square error or MSE is the following equation functions. 


MSE =+ Xli- y) (15) 


3.3. Prediction result in Java and Bali Island 

We use the data from the Java and Bali islands because this area is the top cumulative new case of 
COVID-19 in Indonesia. Java and Bali are the largest tourist and business destinations in Indonesia and from 
outside Indonesia. The data was taken 12 days s for data train and we forecast for four days, from July 6, 202, 
until July 9, 2020. The initial process of experimentation is to perform feature extraction as input data. Then 
it is entered into the model to get the time series output. The final step is to compare the results of the 
accuracy of the three models. Cumulative case prediction results using RR are presented in Figure 3. Based 
on Figure 3 above, the prediction results using the RR model appear to have decreased from the actual data. 
But at the beginning of the prediction, the RR model has results that are close to the actual value. Short-term 
predictions are using RR model prediction results are presented in Table 1. 
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Figure 3. The Java and Bali Island prediction cases using the RR model 
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Table 1 is the results of the short-term prediction of 4 days using the RR model. The experiment 
results show, the difference of the actual and the predicted values are relatively small. However, the 
prediction results on the Bali Island have the difference value is very significant. This explain the RR model 
is not suitable for predicting the cumulative case on the Bali Island. Meanwhile, the prediction results using 
MLR are presented in Figure 4. Figure 4 is the cumulative cases predicting the result of Java and Bali islands 
using MLR. This result shows of the prediction value has a fairly good accuracy than RR prediction results. 
The prediction results detailed are described in Table 2. 

Based on the predicted values in Table 2, it is shown of the predicted values are closer the actual 
values, especially the prediction results in 6 and 7 July 2021 for the Bali Island. This value is closer than 
the RR prediction results in the same dates. Based on this result, we can conclude that the multiple linear 
reression has a better validation level than the RR model. Next, we will compare the predicted results using 
the LSTM model presented in Figure 5. 


Table 1. Prediction results using the RR model 


Region Date Actual value RR Predicted value 
6" July 2021 7,239 7,400 
7T* July 2021 8,591 7,384 
West Java gih July 2021 7,172 7,369 
9" July 2021 7,399 7,359 
6" July 2021 4,048 4,531 
: 7" July 2021 3,823 4,515 
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; 7™ July 2021 504 660 
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Figure 4. The Java and Bali Island prediction cases using the MLR model 
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Table 2. Prediction results of cumulative COVID-19 cases using the MLR model 


Region Date Actual value MLR Predicted value 
6" July 2021 7,239 7,238 
` 7" July 2021 8,591 8,591 
West Java g July 2021 7,172 LIB 
9" July 2021 7,399 7,398 
6" July 2021 4,048 4,048 
: 7" July 2021 3,823 3,822 
Middle Java 98 July 2021 4,232 4,232 
9" July 2021 4,530 4,580 
6" July 2021 1,808 1,808 
Rast Java 7" July 2021 2,548 2,549 
8" July 2021 2,551 2,551 
9" July 2021 2,530 2,530 
6" July 2021 424 424 
ea 7" July 2021 504 504 
Bali Island g July 2021 577 580 
gh July 2021 674 720 
West Java Cumulative Case - LSTM Graph a Middle Java Cumulative Case -ISTMGraph O AN e 
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Figure 5. The Java and Bali Island prediction cases using the LSTM model 


Figure 5 is prediction results using the LSTM model. We find the trend line in Figure 5 is more 
significant than the RR and MLR Models. The trend line is increased and closer actual value. The prediction 
results can be seen clearly on the graphs of East Java and Bali Island, where the prediction results are almost 
the same as the actual values. This means of the LSTM model is recommended to be used in predicting 
cumulative cases COVID-19 in Java and Bali Island in Indonesia. The details of the prediction results are 
shown in Table 3. 

Based on Table 3, it is shown that the prediction results using the LSTM model are closer to the 
actual value. It can be seen in the predicted value on Bali Island on 8* and 9* July 2020. The predicted value 
of this date is more increased than the result in Table 1 and Table 2. This is clearly seen in the predicted 
results of Bali Island, which are better than the RR and MLR models. The detail of the error value 
comparison is shown in Table 4. 

The error value in Table 4 is taken from the difference between the predicted value and the actual 
value. Based on the result in Table 4 can be seen, the LSTM model is recommended to be used to predict the 
cumulative cases in Java and Bali. The results of the model test are presented in Table 5. 
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Table 3. The comparison of the prediction result using LSTM model 


Region Date Actual data Predicted 
6" July 2021 7,239 7,239 
7 July 2021 8,591 8,592 
West Java gm July 2021 7,772 LIB 
9 July 2021 7,399 7,400 
6 July 2021 4,048 4,049 
7 July 2021 3,823 3,824 
Middle Java gu July 9001 4232 4.233 
9 July 2021 4,530 4,531 
6 July 2021 1,808 1,809 
East Java 7*July 2021 2,548 2,549 
8" July 2021 2,551 2,552 
9 July 2021 2,530 2,531 
6" July 2021 424 425 
7 July 2021 504 505 
Bali Island gm July 2021 577 578 
9 July 2021 614 575 


Table 4. Comparison of error values in the RR, MLR, and LSTM models 


RR MLR LSTM 
Region Dale Error % Error % Error % 
6" July 2021 -0.0222 0.0001 0 

WestJava 7” Tuly 2021 0.1405 0 -0.0001 

i 8" July 2021 0.0519 -0.0001 -0.0001 

9" July 2021 0.0054 0.0001 -0.0001 

6" July 2021 -0.1193 0 -0.0002 

. 7* July 2021 -0.181 0.0003 -0.0003 

Middle Java ga July 2021 -0.0633 0 -0.0002 

9 July 2021 0.0088 -0.011 -0.0002 

6" July 2021 0 0 -0.0006 

EastJava 7” July 2021 0 -0.0004 -0.0004 

8" July 2021 0 0 -0.0004 

9" July 2021 0 0 -0.0004 

6" July 2021 -0.5825 0 -0.0024 

Bali Islang 7” ly 2021 -0.3095 0 -0.002 

i 8" July 2021 -0.1317 -0.0052 -0.0017 

9 July 2021 0.0519 -0.0682 0.1469 


Table 5. The evaluation result of model 


Model Evaluation 
R2 MSE RMSE 
MLR 0.989 2.89375x10? 17.11x10° 
RR 0.988 3.97232x10? 3.972x10? 
LSTM 0.998 0.00937 107 96.82x107 


The model test values using R?, MSE, and RMSE in Table 5 showed varying results for the RR, 
MLR, and LSTM models. But the best test value of the three models is the LSTM model. This is clearly seen 
in the largest R2 value among the three models. Furthermore, judging from the MSE and RMSE values, the 
LSTM model also has a better level of accuracy compared to other models. The LSTM model is more 
recommended in predicting infectious diseases, especially COVID-19 on the islands of Java and Bali. 


4. CONCLUSION 

Based on the results of experiments conducted to test the prediction model of RR, MLR and LSTM, 
very significant results were obtained. Several things were produced, among others, the prediction results 
using LSTM were closer to the actual value compared to other models. This can be seen in the value of the 
difference between the predicted value and the actual value. Furthermore, the predicted error value, the 
LSTM model has a lower error rate than other models. Then the results of the model test using R2, MSE and 
RMSE show that the LSTM model has a better value than other models, so this model is recommended to be 
used for prediction of infectious diseases, especially COVID-19 in Java and Bali. In future research, the 
LSTM model will be improved to predict COVID-19 cases globally so that the model can be used in all 
countries. 
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